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Abstract 

We find the asymptotic total variation distance between two dis- 
tributions on configurations of m balls in n labeled bins: in the first, 
each ball is placed in a bin uniformly at random; in the second, k balls 
are planted in an arbitrary but fixed arrangement and the remaining 
m — k balls placed uniformly at random. 

1 Introduction 

Planted distributions arise in several contexts in the study of random struc- 
tures and algorithms. They are used as a means of studying complicated 
conditional distributions: instead of studying a random structure condi- 
tioned on the presence of a substructure, we can instead plant a copy of 
the substructure and add random elements around it. One example is the 
random graph: we can plant a subgraph, say a fc-clique or a Hamiltonian 
cycle, in an empty graph on n vertices, then add random edges on top (see 
[2], [3], [7]). Another class of examples are random satisfiability problems, 
including Random fc-SAT and random graph /c-coloring. In the case of Ran- 
dom fc-SAT, instead of conditioning on there being a satisfying assignment 
to a random formula, we can pick a random assignment uniformly, then 
sample random clauses from the set of clauses satisfied by that assignment. 
Planted Random fc-SAT is thus guaranteed to have satisfying assignment 
and has been used to test algorithms and in the analysis of the satisfiability 
threshold (see for example, pQ, [1]). 

In each of these models, a natural question is how close the planted 
distribution is to the conditional distribution or to the basic random distri- 
bution. The problem of statistically distinguishing a planted Hamiltonian 
cycle in a random graph was proposed by Klas Markstrom and considered 
by a group (including Svante Janson, Colin McDiarmid, Oliver Riordan and 
Joel Spencer) at the Discrete Probability program (Spring 2009) at Institut 
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Mittag-Leffler. The original problem was to determine how many random 
edges are needed to "hide" a Hamiltonian cycle on n vertices by adding m — n 
random edges. The planted Hamiltonian cycle is hidden if the planted dis- 
tribution and standard random graph with m edges are statistically indis- 
tinguishable asymptotically, i.e. the total variation distance between the 
two distributions tends to as n — > oo. 

In this paper we consider a 'pure' version of the problem: a planted ver- 
sion of the balls-and-bins model. We determine the number of random balls 
needed to 'hide' an initial planted configuration of balls. We have two pri- 
mary motivations for studying this model: first, we can answer the question 
of the distinguishability of the planted distribution completely, finding the 
distinguishability threshold and total variation limit for any given planting; 
second, many discrete probability problems can be reduced to an instance of 
balls-and-bins, and so a balls-and-bins result may prove to be a useful tool. 

The standard balls-and-bins model involves throwing m balls into n bins, 
with each ball independently thrown into a uniformly chosen bin. The stan- 
dard model induces a probability distribution on configurations of m balls 
in n bins which we will call the ST(ANDARD) distribution. In the planted 
version of the model, we begin with a fixed arrangement of balls already in 
bins: perhaps one ball planted in each bin, or k balls planted in the first 
bin and none in any others. If we have planted k balls, we then throw the 
remaining m — k balls into bins uniformly at random. The planted model 
induces its own distribution on configurations of m balls in n bins which we 
will call the PL(ANTED) distribution. The PL distribution depends on 
the particular initial planting. Our main question is: how large must m be 
as a function of n so that the total variation distance between ST and PL, 
\\ST — PL\\tvi tends to 0? In other words, how many random balls do we 
need to throw in bins to "forget" our initial planting? 

2 Preliminaries 

Here we introduce our notation for the paper. A configuration Z of m balls 
in n bins consists of a list of non-negative integers, {zi}f =l , ^ Zi = m, where 
Zi is the number of balls in bin i. An initial planting A, of k balls in n bins 
consists of {ai}™ =1 , ^ cti = k, where a, is the number of balls planted in bin 
i. In the labeled case, these lists are ordered, while in the unlabeled case the 
lists are unordered. PL(Z) is the probability of configuration Z under the 
ST distribution, and PLa(Z) or PL(Z) is the probability under the planted 
distribution with initial planting A (with the subscript omitted if the initial 
planting is clear from the context). We measure the distinguishability of the 
two distributions by their total variation distance: 

\\ST - PL\\ TV = ± £ \ ST ^ ~ PL ^\ C 1 ) 

z 
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where the sum is over all possible configurations Z of m balls in n bins. 
The total variation distance depends on m, n, and the initial planting A. 
We write TVA(m,n) for the total variation distance between ST and PL 
with m balls, n bins and initial planting A, or simply TV(m, n) if the initial 
planting is understood. 

We write 'with high probability' or 'whp' if an event holds with probabil- 
ity — > 1 as n — > oo (other authors sometime use 'aas' or 'asymptotically al- 
most surely'). We use the standard asymptotic notation O(-), o(-), 0(-), 
and lo(-). We often combine the asymptotic notation with a statement about 
probability: g(n) = 0(/(n)) whp means that there exists a constant K so 
that Pr[g(n) < Kf(n)] — > 1 as n — > oo. g(n) = o(/(n)) means that for any 
constant c > 0, Pr[g(n) > c/(n)] — > as n — > oo. 



Our main result divides the set of initial plantings into three regimes and 
characterizes the asymptotic behavior of TV(m, n) in each case. For a given 
initial planting A, we define V(A) as follows: 



We can view V{A) as the variance of the random variable A that takes 
value dj, i = 1, . . . n, with probability K It is a characterization of how 'flat' 
or 'hilly' the initial planting is. 

Theorem 1. Let A be an initial planting of k balls in n labeled bins, with 
k >> \fn. Then: 



3 Main Result 



V(A) 



(2) 



1. The Flat regime: V(A) = oi-jj^). Let m ~ ckn 1/2 . Then 





where $(•) is the standard normal distribution function. 





3. The Intermediate regime: V(A) ~ A-^j with A a constant. Let m 



ckn 1 / 2 . Then 
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We make a few remarks about the theorem. In each regime, as c — > oo, 
TV(m, n) — ¥ and as c — > 0, TV(m, n) — > 1, so this scaling gives the correct 
threshold at which the two distributions become distinguishable. There is 
also a smooth transition between the three regimes. 

Note that for some values of k, only Hilly arrangements exist. For exam- 
ple, for k = o(n), all initial arrangements are Hilly. Also, the Flat and Inter- 
mediate regimes are quite restrictive. For k = n, for example, n — 0{^/n) of 
the aj's are exactly 1 in both the Flat and Intermediate regimes. A randomly 
chosen initial planting will be Hilly with high probability. 

Illustrative Special Cases of the Theorem Two special cases of the 
theorem will be useful guides to what follows. 

1. The pure flat initial planting, starting with 1 ball in each bin, or a« = 1 
for all i. In this case the theorem states that the distinguishability 
threshold occurs at m = 

0( n 3/ 2) 

2. A pure hilly planting, starting with k balls in the first bin, a\ = k, 
di = 0, i = 2, . . . n. In this case the theorem states that the distin- 
guishability threshold occurs at m = 0(fc 2 n). 

The remainder of the paper is devoted to the proof of Theorem [1] and 
is organized as follows: in Section 0] we prove the lower bounds on total 
variation distance, and introduce distinguishing statistics in each regime. In 
Section [5] we prove the upper bounds. Sections El and [8] are devoted to 
technical lemmas, and in Section [9] we give some concluding remarks. 

4 Lower Bounds 

Consider the following game with the goal of distinguishing between the 
ST and PL distributions: one of the distributions is chosen at random with 
probability ^ each and then a configuration Z sampled from it. The player of 
the game sees only the configuration and must determine which distribution 
it came from. He wants a strategy that maximizes the probability of selecting 
the correct distribution. For example, in the case of planting all k balls in 
bin 1, one natural strategy would be to look at z±, the number of balls 
that end up in bin 1. If Z\ is higher than some threshold, choose the PL 
distribution, otherwise choose the ST distribution. 

Such a strategy gives a lower bound to TV(m, n) in the following way: 
via Bayes' formula we see that the optimal strategy would be to compute 
ST(Z) and PL(Z) and choose whichever is higher. If we call the probability 
of success using the optimal strategy p* then we have 

p* = \ E ST (z) + \ £ PL(Z) 

ST(Z)>PL{Z) PL{Z)>ST{Z) 
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And so 

2p*-l= Yl ST{Z)+ PL(Z)-YpHZ) 

ST(Z)>PL(Z) PL(Z)>ST(Z) Z 

= Y [ST(Z)-PL(Z)}=TV(m,n) 

ST{Z)>PL(Z) 

So given any strategy with probability of success p, we have p* > p and 

TV(m,n) >2p-l (3) 

In this section we will give strategies in each of the three regimes, cal- 
culate their success probabilities and find the lower bounds for Theorem [TJ 
The strategies are all similar to the strategy described above for the extreme 
case: we choose a statistic of the Zi's and a cutoff value. If the statistic is 
above the cutoff we choose the specified distribution; if not, we choose the 
other. While these strategies give lower bounds immediately, in Section [5] we 
will analyze the optimal strategy in each regime and show that these sim- 
ple strategies are in fact asymptotically optimal and thus give the correct 
asymptotic total variation distance. 

The benefit of these strategies is that they are simpler and more descrip- 
tive than the optimal strategy of comparing PL(Z) to ST(Z): they tell us 
what feature of the planted distribution takes the longest to 'forget'. 

4.0.1 Flat Regime 

We first describe the strategy in the first special case, one ball planted in 
each bin, then extend this to the general Flat regime. With one ball planted 
in each bin, a natural statistic would be to choose the ST distribution if any 
bin were empty and choose the PL distribution otherwise. This strategy 
would separate the two distributions up to m ~ ralogn, but for higher 
scalings of m every bin has at least one ball whp under the ST distribution 
so the statistic fails to differentiate the distributions. A better statistic is 
the number of pairs of balls in the same bin 

n 

PAIRS(Z) = Y 

i=l 

To see why this separates the distributions, compare the first n balls under 
each distribution. Under the ST distribution, the expected number of pairs 
that end up in the same bin is while under the PL distribution 

pairs are in the same bin. The j-th ball, n < j < m, is placed randomly in 
both ST and PL distributions and adds the same to the expectation of 
Y^i=i (2 1 ) an d so the difference in the expectations remains precisely 
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We can exploit this discrepancy to give a lower bound for the total variation 
distance. If we write 

n ( m\ 
qi = — [Zi 4 

we get 

PAIRS{Z) = 

It will be convenient later to use the scaled and shifted statistic qf instead 
of PAIRS(Z). 

For a general initial planting in the Flat regime we adjust the statistic by 
adding weights to the <7j's. The weights are the a^s themselves and naturally 
arise from the analysis in Section 15.11 The statistic is: 

n 

F A {Z) = Y J W 2 i 

i=i 

From calculations in Section |5J 

E ST F A = — + o(l) 
m 

„ „ kn k 2 n , , 

^plFa = T + o(l 

m m z 

and 

var 5T (FA) ~ varpi [Fa ) ~ — 5- 

Under our scaling m ~ ckn 1 / 2 , and we set [if = ^ — to be the average 
of the two means. Our strategy for distinguishing the two distributions is to 
choose the ST distribution if Fa(Z) > fj,p and choose the PL distribution if 
Fa(Z) < We show in Section [6] that Fa is asymptotically normal under 
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each distribution. Since the distance of up from each mean is J m 'i = — 7= 

^ / 2k 2 n 2V2. 

standard deviations under the scaling m ~ cn 3 / 2 , the asymptotic probability 
that Fa(Z) is above or below fj,p in each case can be computed from the 
standard normal distribution function. Thus the success probability for this 
strategy is: 



2 V2\/2c/ 2 V2\/2c/ \2\/2. 



and so TV{m,n) > 2$ (^^7^] - 1 + o(l) from ©, giving the lower bound 
for Theorem [TJ 
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4.0.2 Hilly Regime 

In the Hilly regime we define the statistic 



H A (Z) = y2 a i<n 



i=l 

with qi as in If all k balls are planted in bin 1, Ha(Z) = i=rZ\ — k, 
the number of balls in bin 1 scaled and centered, so Ha is equivalent in this 
case to the natural statistic mentioned in the first paragraph of this section. 
Calculations in Section [8] give 

Est(Ha) = 
Vn 2 



^pl(Ha) 



m 



Vn 2 



\w S t{Ha) ~ vai PL {H A ) 

m 

Notice that the difference in expectations is completely accounted for by the 
first k balls: if we let ^(oj — 1) be the contribution to Ha of a ball that 
lands in bin i, then a randomly thrown ball contributes to the expecta- 
tion, while the k balls planted according to the initial planting A contribute 
— y~!„- a? (a? — 1) = Also this choice of coefficients for a linear com- 

bination of the q^s maximizes the difference between the two means if we 
normalize by fixing the sum of the squares of the coefficients. 

We set uh = to be the average of the two means, and our strategy is 
to choose the ST distribution if Ha > and the PL distribution otherwise. 
Here the distance of fin from each mean is 7^75 standard deviations and Ha 
is asymptotically normal under each distribution (Section [6]). So calculating 



as above, TV{m,n) > 2$ (^J - 1 + o(l). 
4.0.3 Intermediate Regime 

In the Intermediate regime our statistic Ia(Z) is a mixture of the two statis- 
tics in the previous regimes: 



Under the scaling m ~ cn 3 / 2 with V ~ An l / 2 , calculations give 

E S t(Ia) = -^ + o(1) 
„ . . kn k 2 n Vn 2 , , 

Epi(W = __ + _ + _ +0(1) 
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it \ it \ Vn2 k 2 n 
var 5T (iAj ~ var PL (I A ) ? h — ^ 

The average of the means is m = — ^ + ^| + and the distance from 
each mean is 



1 M 1 , , , • ■ 

— \ — I k standard deviations 

2 V c 2c 2 



So we have 



2V(m,n)>2*(y^ + ^)-l + o(l) 



5 Upper Bounds 

The statistics and strategies from the previous section give one way to distin- 
guish between the two distributions, but in principle there could be better 
methods for distinguishing them. In this section we show that the above 
strategies are asymptotically as good as the optimal strategy which consists 
of choosing the larger of ST(Z) and PL(Z). We state a Lemma: 

Lemma 1. Let V = - ^ as in (0). Then 

1. Flat Regime: Let m ~ ckn 1 ^ 2 and assume V = o ( -|W ) ■ Set fj,p = 



kn k 2 n 

m 2m 1 



as in 



Section \4-0-1\ For any fixed e > 0, with ST and PL 



probability 1 — o(l), 

(a) F A {Z) >fi F + e^ ST{Z) > PL{Z) 

(b) F A (Z) <fi F -e^ PL{Z) > ST(Z) 

2. Hilly Regime: Let m ~ cVn 2 and assume V = uj (^-^jj^j ■ Set fin = 

as Section \4-0.2[ For any fixed e > the following hold with ST 
and PL probability 1 — o(l): 

(a) H A (Z) >fx H + e^ PL(Z) > ST(Z) 

(b) H A (Z) < fi H - e => ST(Z) > PL(Z) 

3. Intermediate Regime: Let m ~ cfcn 1 / 2 and assume V ~ A-^j. Set 
Uj = -|S + ^ + X^- as in Section \4lT3} Then for any fixed e > 0, 
with ST and PL probability 1 — o(l), 

(a) I A (Z) > m + e PL(Z) > ST(Z) 

(b) I A (Z) <m-e^ ST(Z) > PL{Z) 

We will prove Lemma [T] in Section 15.11 The proof of Theorem [T] given 
Lemma [T] is very similar in all three cases. Here we prove it for the Hilly 
regime and omit the Flat and Intermediate cases. 
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Proof of Theorem [7J Hilly case. Fix some e > 0. We partition the set of 
configurations as follows: 



n L 


= {Z 


ST(Z) > PL{Z) and H A (Z) 


< 


m -e} 


n 2 


= {Z 


PL(Z) > ST(Z) and H A (Z) 


> 


+ e} 




= {Z 


ST(Z) > PL(Z) and H A (Z) 


> 


+ e} 


n 4 


= {Z 


PL(Z) > ST(Z) and H A (Z) 


< 




n 5 


= {Z 


H A (Z) e (hh ~e,(J,H + e)} 







For n large, <ST(Sl5) < -^e and PL{n§) < -^e since -Ha is asymptotically 

normal and the interval (fin — + e) has width -^e standard deviations 
under either the ST or PL distribution. Now recall the definition of total 
variation distance from equation (pQ). This is equivalent to 

\\ST-PL\\tv= Yl ST(Z)-PL(Z) 

Z:ST(Z)>PL(Z) 

Define the similar quantity 

TV'(m,n)= ST(Z)-PL{Z) 

Z:H A {Z)< m 

which is what we would get as an estimate for the total variation distance if 
we used our strategy from the lower bound. We now show that TV(m, n) = 
TV'(m,n) + o(l). 



TV(m, n) < Y ST(Z) - PL(Z) + ^ ST(Z) + ^ ST{Z) 

and 

TV'(m, n)>J2 ST(Z) - PL(Z) - £ PL(Z) - £ PL(Z) 

So 

TV(m, n) < TV'(m, n) + ST(Q 3 ) + PL(Q 4 ) + ^ ST(Z) + PL{Z) 

ST(£lz) is o(l) by Lemma [T] part 2a. PL(^4) is o(l) by Lemma [1] part 2b. 
The sum over Q§ is < -^e. So TV(m, n) < TV'(m, n) + -^e + o(l). Since e 

is arbitrary, TV(m, n) < TV'(m, n) + o(l). The other side of the inequality 
is similar and so TV(m, n) = TV'(m, n) + o(l). □ 
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5.1 Proof of Upper Bound Lemma 

The idea of Lemma [T] is that with high probability over the choice of con- 
figuration, each of our simple lower bound strategies based on a threshold 
statistic is in fact equivalent to the optimal strategy. The optimal strategy 
is to compute ^t(z) anc ^ P^k ^ ^he ra ti° is > 1. To prove the Lemma we 
compute the logarithm of the ratio, expand the terms and show that in each 
particular regime all the terms are concentrated except for one term which 
corresponds to the respective statistic. Then the sign of the logarithm is 
determined by whether the statistic is above or below its threshold. 

Proof. The exact formulae for ST(Z) and PL{Z) are 

if m 



ST{Z) 

n 

If m — k 



n m \zi . . . z T 



PL ( Z ) n m k \^ Zl _ ai ) . . . ( Zn _ ffln ) 



and so 



We write 



PL(Z) _ n k ( Zl ) ai ...(z n ) an 
ST(Z) ~ (m) k 

H( Zi ) ai = Ex n 

(m) 



(5) 



a. 



k 

m 



E 2 



and get 



§[§j~ El£2 na + *r 



where qi = ^{zi — ^). In Section [7] we show that 

Vn 2 kn 5kn 2 k 2 n 

HElE2) = -^r + 2^ + i2^-4^ + o(1) 

whp in each regime. Taking the logarithm of the ratio, 

PL(Z) Vn 2 kn 5kn 2 k 2 n , . 

ln ^ = -^ + 2^ + 12^-4^ + ^ aa ° g(1 + %)+ ° (1) 

i=l 

Chernoff bounds give that the qfs are uniformly o(l) whp in all three 
regimes, so we can use the Taylor series for the logarithm: 

in PL(Z) _ Vn 2 kn 5kn 2 k 2 n 
11 ST{Z) ~ ~^2m~ + 2m~ + 12m 2 ~ 4m 1 

i i i i 
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where the higher order terms are o(l) from the Chernoff bound. Next we see 
from our calculations in Section [8] that the variances of Yl a i1i and Yl a ilt 
are o(l) in all regimes, and so they are concentrated around their means. 
This gives whp: 

. PUZ) Vn 2 kn k 2 n v-^ 1^ „ 

We now analyze this sum under the specifics of each regime. 



5.1.1 Flat Regime 

Here m ~ ckn 1 ^ 2 and V = o (^^jj^j ■ Here the mean and variance of Oi<& 

are o(l) for both the ST and PL distributions, and X^- = o(l), so whp we 
have: 

, PL(Z) kn k 2 n V 2 

ln 5rM = 2^-4^-2^^ +o(1) (7) 

But J2i a iQi is precisely our statistic Fa(Z). By assumption in LemmaQ] 
part la), F A (Z) > » F + e = % - g| + e, and -\F A {Z) < + + -§ 
so 

and so whp, ST{Z) > PL(Z). Proving lb) is similar: the assumption 
says that F A {Z) < & - - e, which gives In > § + o(l), and so 

PL(Z) > ST(Z) whp. 



5.1.2 Hilly Regime 

In the Hilly regime we have V = u;(fcn~ 3//2 ) and m ~ cFn 2 . In this regime 
= o(l) and ^caqf is concentrated around its mean, giving simply: 

, PL(Z) Fra 2 ^ 

whp. Y2i a i1i * s Ha(Z). By assumption in part 2a) of Lemma [H > 
+ e, so In > e + o(l) and ST(Z) > PL(Z) whp. Similarly under 

the assumptions of 2b) we get PL(Z) > ST(Z) whp. 



5.1.3 Intermediate Regime 

The intermediate regime is similar: £V a^j — | ^ cnq 2 is the statistic I A (Z), 
and the conditions on imply the result. □ 



11 



6 Asymptotic Normality 



6.1 Hilly Regime 

In the Hilly regime, our statistic Ha(Z) = Yi=i a i1i under the ST distribu- 
tion can be rewritten as Ha{Z) = YjLi where the Yj's are i.i.d., one for 
each ball, with Yj = — (ctj — — ) if ball j is in bin i. To apply the Lindeberg- 
Feller Central Limit Theorem (see, for example, [6]), we check that the YJ's 
are bounded. Yj > — = o(l), and 

nM 
m 



where M = maxj a;. Y a f = V n + ^> so M < y ^ + and 

Y < Vk 2 + Vn 3 
j ~ cVn 2 

cn^V V c z nv 

since we are in the Hilly regime and k >> yjn. The Lindeberg- Feller CLT 
then implies that . Ha =>• N(0, 1). 

Under the PL distribution we write Ha(Z) = ~ Y a \ + Sj=d™ ^S' where 
the Y^'s are as above and correspond to the m — n randomly placed balls. 
Again Lindeb erg-Feller gives r a-^plH a ^ N t Q ]\ 

yvar(_H"A) 



6.2 Flat and Intermediate Regime 

In the Flat and Intermediate regimes there are not such simple represen- 
tations of Fa or Ia as sums of independent random variables, but both 
statistics fit into a general framework of statistics of 'occupancy scores' that 
have been proved to have normal limits is a series of papers ([8], [9J, [ID] . 
The following Theorem from [8] suffices for our cases: 

Theorem 2. Let z±, . . . z n be a multinomial vector with parameters (p%, . . . p n ) 
and Yi z i = m - Let fi, ■ ■ ■ f n be degree 2 polynomials and S n = Y7=i fi( z i)- 
Suppose: 

1. maxi<j< n pj = o(l) 

2. mini<j< n mpi is bounded away from as n — > oo 

3. maxi<j<„ var{fi{zi))/ Yi var (fj( z j)) = °(1) 
Then 4= =>N(0,1). 
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In our case Pi = h for all i, and ^ is bounded away from under 
our scalings. Under the ST distribution we set fi(zi) = [z$ — a 

polynomial of degree 2. The z^s have the same distribution and the /j's are 
the same except for the factor aj, so ^(/'(fO) = In the Flat and 

Intermediate regime maxj a» < f + V^n, and J2 a i — 7T> so = 

i—i ^ 

and Theorem [2] applies. 

For the PL distribution we let yi be the number of the m — k random 
balls that end up in bin i. Here the yi,...y n are a multinomial vector with 
V% = \ and Y^Vi = m ~ k - We let fo( z i) = T^ a i ( a i + Vi ~ Tf) • As above, 

= °« as lon § as var + W - 7T) 2 ) ~ var (fo + Vi ~ ^) 2 ) 
for all i,j. This holds since maxa/j = o (^) and both variances become 
var (y 2 - 2^(1 + o(l))). 

7 Error Term 

We will need the following proposition about the a^'s: 
Proposition 1. In all three regimes, 



in 2 

n 2 

m 3 
n 3 

Proof. 1) is the definition of V. For 2), we write aj = - + Aj. Then we have 



i- E«f = 


id 

Tl 


+ nV 


8. E«? = 


hi 

n 2 


+ .(. 


5. E«f = 


k± 

n'' 





n k 2 k 

n ^ 

where we use the fact that ^ Aj = 0, so ^ A 2 = Un from the definition of 
V. Now we write 

E a ^ = E — + 3 — A * + 3 - A i + A * 3 

i=l i=l 

t 3 n 

= ^ + 3^ + E A ? 

i=l 

It is straightforward to check that kV = o (j^j in all three regimes. We 
bound 



I! 



£Af< Y,a> ={Vn) 



i=i \i=i 




n3/2 
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and similarly we can check that (Vn) 3 / 2 = o (^Pj in all three regimes. 
For 3), we write 

i=l i=l 

/,' ..// 2 ./, " 



n 3 n n ^ ^ 



i=l i=l 



Again we can check that = in all three regimes, and we bound 

the last two terms by ^(Vnf/ 2 and (Vn) 2 respectively, both of which = 
o (t^t) m an three regimes. □ 

Now we prove the main Lemma of this section: 

Lemma 2. Let 

p _ X\i=l( z i)aj 

1 L=i z i 

and 

k 

m 

E12 



{m)k 

Then whp in all regimes, 

Vn 2 kn 5kn 2 k 2 n 
1 2 2m 2m 12m 2 Am 2 

First we use a standard asymptotic approximation of (m)^: 

m k k 2 k 3 
In Ei = In - — — = 1 7T + ... 

[m)k 2m 6m z 

Next we compute the asymptotics of E2 • 

n , 

17 TT \ z i)ai 

E2 = YY^r 

i=l Z i 



II. (• .')••• 



i=l 



1 _ °» ~ 1 
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and so 



n di — 1 



ln£ 2 = £I> [l-i 



=1 j=o 

n a; — l 



E 

i=l 
n 

E 



di— 1 



a;-l 



a,;-l 



i=o 



j=0 



3=0 



la?-Oj 1 aj(oj - l)(2aj - 1) 1 a? (oj - 1) 



+ 



2z 2 



6 



+ 



3z 3 



+ 



Using long division we see that 

}__n_ 

Zi m 

and a Chernoff bound shows that whp 



(i - Qi + qf - qf + 



for all i. We expand the expression for In E2 term by term: 



E 



1 af - aj 



n 

2m 



A: 2 



Un H y2i a i ~ a i)(qi 



+ . . . 



Un 2 k 2 n ^ 2 



2m 2m 2m 



22 2 , 

i Qi - mi +■■■ 



i=i 



Using Proposition Q] and calculations from Section [HJ we see that whp, 
sum are all o(l). This gives: 



2m 



-, and the other terms in the 



" 1 2 

El gj - a% 



8=1 

The next term is: 



E l ai(aj - l)(2aj - 1) 
0^2 



1 2 ^ 

«=i 4 



6 



Vn 2 



2m 2m 2m 2 2m 2 



k 2 n kn 2 , , 
+ — + o(l 



n 



12m 2 



^(2a 3 -3a 2 + a i )(l-2 (7i + 3g| 



i=l 



Again using Proposition [H we calculate that whp 



E l aijaj - l)(2aj - 1) 
i=i 1 



k 3 k 2 



6 



+ 



n 



kn 2 



6m 2 4m 2 12m 2 



o(l) 
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For the remaining terms, only the leading terms will be 6(1), and that only 
for large values of k. Those leading terms are precisely the negative of the 
remaining terms in the asymptotics In = ^ + + • • • • Putting this 
together gives the asymptotics of \a{E\E2) and Lemma [21 

8 Calculations 

In this section we calculate the mean and variance of E a^, E cnqf, E cnqf , 
and E a ilt- The results: 

Proposition 2. In each of our three regimes the following hold: 



Est E a iH 


= 0, uarsHE a i<?i) = ^ 


EpL E a iQi 


= ^ «a^z(Eoi«) = 5 S L + o(l) 


EstE^^ 2 


= § + o(l), mrs T (E «^ 2 ) = ^ + °(1) 


EPL £ 


= S - & + wi(E «^ 2 ) = ^ + o(l 


Est £ a i9i 


= ^ + o(l), wst(E«^) = o(1) 


Epl E ^Qi 


= ^+o(l), «arp L (Eai^) = o(l) 


Est E a ilf 


= ^ + o(l), wst(E«^ 4 ) = o(1) 


Epl E a iQi 


= ^+o(l), mrp L (Ea^ 4 ) = o(l) 



We show the calculations for E a «% an d £°i<2f- The calculations for 
E a i<?f an d E a i<if are somewhat tedious and unenlightening and so are 
omitted. 

8.1 E a *<?i 

For the ST distribution: We write Ej a i<?« = £j=i X? wnere the Y^ 's are i.i.d. 
and Yj = ^r(ai — — ) if ball j is in bin i. From this we see Est £««<?« = 
and varsp(E a iQi) = m ' var ( Y j) = m ^ = 

For the PL-distribution, we write Ej a i1i = a i( a i — |)+£^=i = 
^ — I" Ej=i fc where the Qj's are the same as above but indexed over 
the m — k randomly placed balls. This gives Ep£,(£aj(?i) = and 
var PL (E«j%) = (m-k) ■ var(l}) ~ 
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8.2 £a,g 2 

for the ST distribution we write 

m 

i j<k 1=1 

where Pj >r = f^rOi if the pair of balls j, r are both in bin i, Pj )T = if j, k 
are in different bins, and Yi= ( — — ) a, if ball I is in bin i. So 

%r a ^ = k+ (™) EP i> r + mEYl 

i 

We calculate that E ST Pj,r = %z and E ST Yi = $ - f giving 

Est Voi9i = = — + o(l) 

m m m 

i 

And the variance is: 
var 5T (^ a^ 2 ) = Q var(P J - r ) 

(777>\ 
2 Jcov(p J -,,y,) 

since the P/, r 's and Yj's that do not share an index are independent. We 
calculate 

var(P iir ) = 2^ a. 



4 m 4 m 4 



4rafc 2 4Fra 3 4fc 2 
to 4 m 4 m 4 



var(y) = W 3 + — | f 



4n 2 . 4n\ fc 2 n 2 4fc 2 n 4fc 2 
J m 4 rrfi m 2 



Vn 4 AVn 3 4Vn 2 

+ 5- 



m 4 m 3 



^ x 4n n 4/c 2 /A; 2 Tr \ n 4/c 2 

cav(P jtr ,P r> i) = V —a 2 - — = 4 — + Vn) 



m 4 m 4 



2 



_ Wn 
m 4 

/ 2n 2 4n \ 9 A; 2 n 4/c 2 
cov {Pu, Yi = > — - — a 2 - 2— + — 

i x 7 
_ 2Vn 3 4Fn 2 
m 4 m 3 
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which all together gives 

var 5T a iqi ) = + 
in all our regimes as long as k >> \fn. 

And for the PL distribution: 

E2 n2 \ -» / m\ 2 

i i 

where is the number of randomly thrown balls in bin i, so Z{ = en + r,. 
Expanding we find the constant part is: 

_ n 2 v-^ ( 9 ^ 2 m \ ^ 3 , 2/c 2 2Wi 2 , , 
C = — \ ai < + — 5--2fli— = — + /c +o(l) 

^2 ^3 ,3 

since in all regimes — = — 7 + o(l) (see Proposition [2]). The random 
part is: 

2 2 

— >. + — 2 X, 2 °i - 2a i- 

i i 

If we let Ri be the number of pairs of random balls in bin i, then rf = 2Ri+ri 
and the random part becomes: 

2n 2 x - n 2 y. - / 2 m \ 

— w } ciiRi H y } 2a^ -2a, h a, 

i i 

So we define Pj ;r to be ^r^i when the pair of random balls j, r are both in 
bin i, and Yi = (2a 2 — 2a, — + a,) when ball Z is in bin i, and then 

m— fc 

£>gf = C + £P J>+ ^ 

j<r 1=1 



We have: 



2/c 2 kn 2Vn 2 2k 
EY l = — + — + 

m z m z m z m 

2k 

Eft r = —= 



and so 



^ ^ o k 3 7 2k 2 2Vn 2 
JEpl > J diCfi =—y +k 



m m 



,j2k 2 kn 2Vn 2 2k\ fm-k\2k 

+ ipi-k — + — + g + — + 0(1 

\ m z m/ m z m J \ 2 J m z 
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since Vn ,, k and — = oil) in all regimes. 

Variance: The Pj,kS are the same for the PL distribution as for the ST 
distribution and for the Yj's we calculate 

var ( y ') = + ° ( m ) 

4I/n 2 

cov(P^,y;) = -^ + o(m- 2 ) 

which gives 

var PL ( > Oift ) = — =- + o(l) 
— ' m 



9 Concluding Remarks 

We discuss briefly a natural modification of this problem suitable for further 
study. 



Unlabeled Bins Suppose the bins are now indistinguishable. It becomes 
more difficult now to distinguish the two distributions (since we could have 
ignored the labels in the previous case), so the correct scaling for m will be 
no greater than in the original labeled case. In the case of planting exactly 
one ball in each bin, the scaling is actually the same - as mentioned above, 
the statistic ^ • qf is an asymptotically correct statistic in this regime, and 
since it does not depend on the labeling of the bins, we can use it just as is in 
the unlabeled case. The Hilly regime however is much different. There our 
distinguishing statistic, Yl a iQi depended very much on the labeling of the 
bins. In the case of n balls planted in one bin and none in the rest, a natural 
guess of the right statistic would be the maximum number of balls in a bin. 
This turns out to be correct and it comes with a different scaling for m, 
m ~ ttt~~ — ( 1 H — 7== ) with c G (—00,00) a constant. The distribution of 

2 log n y v log n J 

the maximum number of balls in a bin, studied in [12], is very useful in this 
case. The general case seems to be related asymptotically to the problem of 
distinguishing between two scenarios: one in which we have n independent 
N(0, 1) random variables and one in which we have a collection of normal 
random variables with means different than 0. Such a problem was studied 
in a different context in [3]. 
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